Skip to content

[spark] support lateral inner join for vector search#8252

Open
Stefanietry wants to merge 1 commit into
apache:masterfrom
Stefanietry:support_lateral_join_for_vector_search
Open

[spark] support lateral inner join for vector search#8252
Stefanietry wants to merge 1 commit into
apache:masterfrom
Stefanietry:support_lateral_join_for_vector_search

Conversation

@Stefanietry

Copy link
Copy Markdown
Contributor

Purpose
Purpose: Support lateral join for vector search on spark.
Linked issue: #8251

Tests
Add vector search with lateral join on org.apache.paimon.spark.SparkMultimodalITCase#testVector、org.apache.spark.sql.test.SQLTestUtils#test("lateral vector search preserves subquery alias qualifiers")

@Stefanietry Stefanietry force-pushed the support_lateral_join_for_vector_search branch from 8aa9c09 to 774c9b6 Compare June 16, 2026 08:05

override protected def doExecute(): RDD[InternalRow] = {
child.execute().mapPartitions {
outerRows =>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can batch queries be supported? Batch queries are crucial for performance. You can take a look to benchmark in https://github.com/apache/paimon-vector-index

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reminder. I'll refine it in batch mode later.

@JingsongLi

Copy link
Copy Markdown
Contributor

Please fix test failures.

@Stefanietry Stefanietry force-pushed the support_lateral_join_for_vector_search branch 2 times, most recently from 4697f65 to c23c76b Compare June 22, 2026 15:04
@Stefanietry Stefanietry reopened this Jun 22, 2026
@Stefanietry Stefanietry force-pushed the support_lateral_join_for_vector_search branch 2 times, most recently from a1e3745 to 835b339 Compare June 23, 2026 05:38
_.toPaimonDataField)).asJava)
}
val sparkRow = SparkInternalRow.create(resultRowType)
val vectorSearchBuilder = innerTable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal Spark vector_search scans apply pushed partition/data filters before top-k (PaimonBaseScan.evalVectorSearch passes pushedPartitionFilters/pushedDataFilters into the builder). This lateral executor builds the BatchVectorSearchBuilder here without carrying predicates from the search side; PushDownLateralVectorSearchFilter only pushes predicates that reference the left child, so predicates on r.dt or other searched-table columns stay above LateralVectorSearch. A query like ... JOIN LATERAL (...) r WHERE r.dt = '20260420' will pick topK over all partitions and then filter the joined rows, which can return fewer or wrong rows compared with non-lateral vector_search(...) WHERE dt = .... Please preserve search-side filters and apply them via withPartitionFilter/withFilter before newVectorScan()/readBatch(), or reject such predicates explicitly.


scan.plan().splits().asScala.iterator.flatMap {
split =>
val reader =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reader is only closed when this inner iterator is fully exhausted. If a downstream operator short-circuits consumption, for example LIMIT 1/take, or if the task is interrupted, Spark can stop pulling rows before hasNext returns false, leaving the current PaimonRecordReaderIterator and its underlying RecordReader open. Please register a TaskContext completion listener or wrap the returned iterator so the current reader is closed on task completion/cancellation as well as normal exhaustion.

@Stefanietry Stefanietry force-pushed the support_lateral_join_for_vector_search branch from 835b339 to 07428ac Compare June 23, 2026 08:00
@Stefanietry Stefanietry force-pushed the support_lateral_join_for_vector_search branch from 07428ac to c9ee0dd Compare June 23, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants